Use of virtual database technology for internet search and data integration

ABSTRACT

This invention discloses how Virtual Database Technology can be used to make disparate data appear to be (or act as) the sort of uniform data one expects to find within a single relational database. In particular, we show how to process queries similar to those one might use in a database, even though the underlying data may be missing some of the capabilities that are required by normal databases. Whereas traditional databases require that all the tuples in a table be stored, our approach allows queries over tables where the tuples are generated as required from the data sources, and may not be stored anywhere. We show how such facilities can be used as a new foundation for Internet search.

BACKGROUND

Internet users today have access to a large amount of useful data but there is no effective way to combine the data from different sources. For instance, if you wanted to attend a college football game of your alma mater on the other side of the country, you might find a schedule on the web. However, in order to decide which game is easiest or cheapest to attend, you would have to copy the dates and locations of the games into travel sites in order to see how much it would cost to get to the locations of the games on the dates when they were to be played. If all of this data were in a common database, this activity could be automated as part of the process of answering a single database query involving multiple sets of data. In the example above the query would involve a ‘join” of the football schedule data and airline data.

Web pages are not the only source of useful data. The same argument applies to files and documents that reside on a local computer, e.g., spreadsheets, and even such real time data as network usage or load average statistics. Other network services such as RSS, Chat and Twitter could also provide data that one might wish to be able to combine with other data in a single search query.

The commercial databases available today are certainly capable of representing and storing all of the data above, but that data is simply not available in a form where this option is feasible to implement. In the example above, the airlines almost certainly do have a database containing all of the available flights, but they do not make that data available as a single table. Rather, they provide a web site that requires a user to supply a point of departure, a destination, and a date before responding with a list of flights relevant to that specific query.

Unfortunately, the dominant database technology available today, Relational Databases (ROB) cannot make use of data in this form. RDBs expect to deal with complete tables containing ALL of the data of interest at once.

It seems possible to find a list of all airports and then send queries to all airlines asking about flights from every airport to every other airport on every day for the next six months (or however far in advance they schedule flights). Indeed, [11 is a related patent to this application that teaches this kind of method to generate all the tuples of relations in a virtual database. However, schedules change from day to day, and price and availability data change even more often. An airline might not mind a single user or search engine sending a huge number of queries every few months, but they would certainly not want to deal with such a large number of queries every hour or so.

We describe below an alternative way to use the data that is available in order to answer the same sorts of queries that one might make of a relational database. This method ends up using the data in a way more like the way humans use it. In the example with the football games, the human user did not need a complete airline schedule. He only needed to look up flights from where he lives to the cities where his football team was going to play, and only for the dates on which the team was to play in those cities.

While RDBs impose requirements on data that are unrealistic for much of the data available to computer users today, there is another technology, which we refer to as Virtual Database (VDB) technology, that turns out to be much better for the purpose at hand. This can be viewed as a generalization of RDB technology. The data of interest is represented in “relations”, where we can define a relation as a set of sequences of objects, where every sequence in the set has the same (predetermined) length. These relations are more like the predicates of first order logic than the tables of RDBs. From a computational point of view, there are a number of different operations that might or might not be supported by a particular implementation of such a relation. Most of these operations are assumed (and required) to be supported by RDBs, e.g., there must be a way to add a sequence (row) to an ROB table and there must be a way to generate all of the sequences (rows) of a relation (table). These operations are generally optional In VDB technology. A VDB query optimizer can still be built, but unlike an RDB query optimizer, there are queries that cannot be answered. That is, no algorithm exists to answer them. Never the less, in many practical cases, such as the football example, it is possible to find algorithms, and even efficient ones.

It is worth mentioning that many things people view as computations rather than “data” also make sense as VDB relations. For instance, the “plus” relation can be viewed as the set of triples, x, y, z, such that x+y=z. Although there is no way to generate all such triples, there are methods for generating, e.g., for any given input values x and z, all values for y such that x+y=z. To describe the plus relation one would provide three such algorithms, each generating one column given values for the other two. In addition one could provide a testing operation, but this could be easily deduced from any of the three generating operations. The plus relation never changes, so there would be no operations for adding or removing sequences from the set.

Another way of viewing the football problem is that the airline schedule is a function which takes as input two airports and a date, and returns a set of flights along with times and costs. This function is then to be applied to the results of another function which computes (from no inputs) a set of times and places where a certain football team is to play.

An existing implementation of a VDB may be found at [31—this is a web site where the user manual for the VDB system is published. Unfortunately, these contents are not available in any other publication. [31 provides all the information necessary to use the VDB implementation, including how to define relations, how to annotate relations, how to specify new relation representations including computed relations. The innovations underlying the system of [3] has been described in detail in papers [4, 5, 6]. Both [4] and [5] provide details about how queries written in the language of [3] can be compiled into efficient run-time code. The system described in [3] was also extended to provide event correlation primitives in a related US Patent referenced in reference number [2]. The objective of the present invention is to use the VDB technology to view many Internet data sources as relations. Once the data is viewed as relations, the VDB query language can be used to realize a completely different kind of Internet search. In addition, the present invention also discloses methods to make this kind of Internet search easier for typical end users to access and deploy.

SUMMARY OF INVENTION

VDB abstracts the notion of a relation by allowing programmers to provide computations that could be used to generate and to test the relation. Note that VDB does not require that the relation be fully generable. So, the methods specified do not have to generate all the tuples of a relation. The present invention uses this flexibility to view arbitrary data sources on the Internet to be viewed as VDB relations by providing code to generate the relation or test the relation. Any data source from which one can extract tabular data (even partially) in an automatic way can be viewed as a VDB relation.

In the conception of VDB [3], while computations could be viewed as relations, the computations were assumed to be performed entirely inside the VDB process, using only data internal to that process. This patent covers the extension of VDB technology to computations that communicate with other, external data sources, such as web servers, in order to treat the external data that can be retrieved (or one might as well say “computed”) from those sources as relations. In essence, we will take programs that retrieve or compute data from web sites, possibly requiring input in order to do so, and view those programs as the way to partially generate relations. The inputs and outputs are all columns in the same relation. In some cases there may be different programs that compute different partial results for the same relation, similar to the different programs that compute the different columns of the plus relation.

While VDB technology supports transactional semantics similar to those of traditional RDBs, it can also be used in applications where transactional semantics do not apply. Specifically, in the universe of diverse data sources (such as web pages) over the Internet, there is no control over updates to external data, and no guarantee that the data retrieved from different sites are valid at a later time. And yet the computation involving looking up flights, as described above, could still be very useful in practice.

For example, if we want to specify an airline schedule that is available through a Web form at a particular URL, we could view this data as a table containing columns such as the StartingCity, Destination, Date, TimeOfDeparture, TimeOfArrival, Cost and so on. This table is not fully generable, but the form provides a method to generate the rows of the table given specific values for the StartingCity, Destination, and Date, for example. Once many different data sources are defined as a set of VDB relations, the present this invention is to allow Internet search over these data using the VDB query language. Traditional Internet search, as exemplified by Google, Bing, and Yahoo, use keyword search to list results as a list of URLs. Users must then go through the URLs themselves to understand the contents of the results. If users need to make decisions by comparing values from different URLs or need to perform even the simplest of calculations, the task must be performed manually, and, therefore, becomes tedious and error-prone. In contrast, the present invention allows the full power of first order logic queries to be used to correlate information from different data sources and to perform calculations for the user. Results are presented in the form of a table, which many Internet users find intuitive and convenient to understand and to use.

Furthermore, because queries are complex for typical Internet users to define, the present invention views search queries as reusable objects that can themselves be represented as URLs and indexed by keywords in standard search engines. A reusable search can be restricted to just the required user inputs, thereby making the query as flexible as possible for end users, who can apply the search to their own inputs without understanding how to define the logic of the query. Typical Internet users would locate the reusable searches by providing input keywords (as they currently expect to when they interact with any search engine).

DETAILED DESCRIPTION OF INVENTION Drawing Descriptions

FIG. 1 shows a diagram of multiple heterogeneous data sources.

FIG. 2 shows an example of multiple, heterogeneous data sources.

FIG. 3 shows a flowchart for the process of creating metadata artifacts.

FIG. 4 shows a flowchart for what an embodiment of the present invention.

FIG. 5 a sample of a metadata specification.

FIG. 6 shows a sample search query.

FIG. 7 shows sample result table.

FIG. 8 shows a sample reusable search query.

DETAILED DESCRIPTION

FIG. 1 shows a diagram of multiple heterogeneous data sources 1, 2, 3 and 4 operating over a network 5.

The current state of the art to access data from any of the sources 1, 2, 3, or 4 includes methods for automating such processes as one that simulates a user entering two cities and a date into an airline site and retrieves data on flights from the result. Of course, there are many other possible sources of data, including foreign RDBs, data files (whether found on WWW or the local machine), etc. Our main innovation in the present invention is the idea of treating these computations as relations (or as ways to partially generate relations) in a VDB. Once a number of such diverse data sources (such as data sources 1, 2, 3, and 4) are so represented as relations, it is possible to implement new search services over this data using the VDB's query language.

When users have specific search questions that involve information from multiple data sources (web pages), such search facilities will prove to be far superior to keyword-based search ala Google or Bing. This is because the new search algorithms exploit the underlying relational structure and semantics of relations by using the power of a query language to integrate data from different sources.

In discussing some of the examples below we use the Ap5 language [3], which is an extension of the Common Lisp programming language, and uses the same syntax as Common Lisp. The examples should be comprehensible to anyone with a programming background. Suppose we have implemented the function in a Common Lisp program:

-   -   get-amtrak-cached (from to year month day)         where from and to are strings corresponding to train station         codes (or names) and year, month and day are numbers encoding a         date, which accesses the Amtrak web site and returns a list of         entries of form (departure arrival cost), where departure and         arrival are time strings, such as “22:35” and cost is a numeric         number of dollars. (The term “cached” indicates that this         particular function saves the results retrieved from the web         site and reuses them without accessing the web site again if         they are not considered too old.)

That function could be used to define the following VDB relation:

-   -   (amtrak-sched fromstation fromtime tostation totime day month         year cost)         as described in FIG. 5.

For a complete understanding of this example the reader is referred to (31. (This reference should be described) We describe here only a few general features.

-   -   :nonatomic t         means that this relation does not support transactional         semantics (this should be defined). This is related to     -   :type-enforcements (:none :none :none :none :none :none :none         :none)         which means that the types specified by     -   :types (string#fromstation string#fromtime         are not to be enforced. Enforcing them would require         transactional semantics.     -   size ((input output input output input input input output) 9)         means that we estimate that the function will, on average,         return 9 outputs for a given set of inputs. The inputs         correspond to the arguments of the function and the outputs to         the results. This is used for query optimization. Similarly, the         number “1e6”, written:     -   leS         at the end of the example is an estimate (one million, in         floating point scientific notation) of the time required to call         this function. The units are really only relative, but the         simple computations tend to have estimates on the order of 1, so         this means that calling the function is much more time consuming         than simple computations.     -   :generator ((simplemultiplegenerator . . . ))         describes how the function is used to generate time and price         data from station and date inputs.

We refer to the descriptions of a data source as relations as a VDB Metadata Specification for that source. FIG. 2 shows an example of multiple, heterogeneous data sources 1, 2, 3, and 4 described by separate metadata artifacts MD1, MD2, MD3, and MD4 respectively. FIG. 2 also shows a computer system 6 used to execute an embodiment of the present invention 7. All of these components operate over the network 5.

It is important to note that the locations of the VDB Metadata Specifications MD1, MD2, MD3, and MD4 do NOT have to be the same as the location of the data for data sources 1, 2, 3, and 4 respectively. This point is of great significance because we do not want to wait for data providers to change their web pages in order to be able to access their data.

In fact, it is possible to post metadata specifications MD1, MD2, MD3, and MD4 on the Internet in order to share it with others. When one person creates metadata for a web site, and makes it available on a public web page, it becomes possible for everyone, everywhere to treat the data at that site in a relational manner using the VDB technology (for example, when users search for “Amtrak Metadata” or “Inflation Metadata”). For that matter, there is no reason that there cannot be more than one relational view of the same data, possibly created by different people at different times and residing in the same virtual database.

FIG. 3 shows a flowchart for the process of creating metadata artifacts MD1, MD2, MD3, MD4 for data sources 1, 2, 3, and 4. Specifying the metadata for relations entails: specification of the location 9 of the data source; and the following components of a metadata specification:

-   -   Specify relation name 10     -   Specify number of relation columns (relation arity) 11     -   Identify column names 12     -   Identify column types 13     -   Write user documentation 14     -   Supply code for relation tester 15     -   Supply code for relation generator 16     -   Supply annotations as relevant 17     -   Supply caching time limit 18

All the elements described in FIG. 3 are assembled into a single metadata specification for the relation. FIG. 5 a sample of a metadata specification MD1 submitted to an embodiment of the present invention 7 over the network 5. Once the metadata specification, such as MD1, is processed by the embodiment of the present invention 7, one could enter a query such as follows:

-   -   (listof (x y z) s.t. (amtrak-sched “LAX”×“LVS” y day month year         z))         to find results similar to calling the function     -   (get-amtrak-cached “LAX” “LVS” year month day)

However, many other queries would also be possible. For instance, if we had other relations describing shows in Las Vegas, such as

  (show-category show category) (show-time show year month day start end) (show-price show price) where show is the name of the show, and category is a category such as “comedy”, then with the help of a few easily defined computational relations we could construct a large number of useful queries, such as

  (listof (depart arrive day mon year tcost show scost start end cost) s. t.  (and   (amtrak-sched “LAX” depart “LVS” arrive day mon  year tcost) (show-category show “comedy”)   (show-price show scost)   (show-time show year month day start end)   (time> start arrive)   (+ scost tcost cost))) which finds comedy shows and trains from Los Angeles to Las Vegas on the same days as the shows, where the train arrives before the shows start, along with the cost of the show, the train and the sum of the two. It would be easy to add another condition to eliminate results with total cost over $100 or to require that the show start at least 2 hours later than the train arrives, or at most 4 hours later, etc.

FIG. 4 shows a flowchart for what an embodiment of the present invention 7 would do with a search query 19. If no algorithm is found for the search query 19, a compile time error 20 is reported. If one or more algorithms are found, the optimal algorithm 21 is selected and executed. Execution of the optimal algorithm 21 results in the execution of the testing code 22 and/or the generating code 23 for various virtual relations. At the end of successful execution, a results table 24 is displayed.

Once the information from external data sources is described using VDB's framework, one can view many complex search problems as queries over the relations defined in the VDB. The queries so defined provide the ability to integrate data from a diversity of data sources using the single common denomination of relations. FIG. 6 shows a sample search query 19 issued to an embodiment of the present invention 7.

FIG. 7 shows sample result table 24 displayed by an embodiment of the present invention 7 if compiling a sample search query 19 results in an optimal algorithm 21 is successfully executed.

This capability can be used to build innovative new Internet search services and various kinds of data integration applications over heterogeneous data sources. Although any user can utilize this kind of search, one practical challenge is that typical Internet users will lack the knowledge and patience to define queries in the VDB query language. To aid adoption with typical Internet users, we will need to devise methods for users to use queries created by others, and to customize the search queries that others have defined.

Search queries, such as the query shown in FIG. 6, provide search results that are far more useful to end users. However, such search queries would also be much more complicated for users to define. Currently, Internet users are accustomed to getting search results by entering no more input than a set of keywords.

To make it easier for average Internet users to use the described search capabilities, the idea that we propose is that knowledgeable individuals define and save the queries, and organize them in a fashion that allows other Internet users to find the search queries in different situations, Average users are allowed to reuse existing query templates, either using them as designed or tailoring them (e.g., by providing inputs) to their own specific requirements.

Reusable searches are also created as separate web pages visible to the public. Making these pages public makes them accessible to search engines such as Google, Yahoo, and Bing. A reusable search page presents the user with the following:

-   -   A simple form to allow users to provide all required inputs for         the search;     -   An explanation of what the search does; including the output         table that can be obtained by initiating the search;     -   A button to initiate the search;     -   A button to customize the search;

All the other logical details of the query are hidden from the user (unless the user chooses to customize the search). FIG. 8 shows a sample reusable search query 19 restricted to just required inputs in order to simplify use of the query for typical Internet consumers. Search results 19 are displayed if users supply the required inputs and click Search.

The user can also choose to customize a reusable search. In this case, the user will see the search query in its entirety, including all the logical clauses of the query and all the required user inputs. The user can alter the logical clauses of the search query and change the set of inputs if desired. In effect, the user is supported through a graphical user interface (GUI) in using the reusable search as a template to create a new search query.

The search facilities described herein require retrieval of data from all the relevant data sources in real-time over the Internet in order to answer queries. This will invariably present performance challenges, relative to present-day search engines that focus on keyword indexing data that is invariably stored on a single machine.

One strategy to improve the performance of the search queries is to cache data from different data sources. For example, if a user retrieves the airline schedule for Acme Airlines fora given date travelling from City A to City B, that data can be cached so that subsequent queries that need to compute the schedule from City A to City B for the same date can be executed much faster. The key issue with data caches in this application is for the query optimizer to know when the data cache can be used and when it is required that the data be generated anew by accessing the data source over the network. Caching policies are part of the metadata specification for each data source. In particular, we allow metadata specification to indicate the maximum length of time that data can remain in the cache and still be valid. For airline schedules, the schedule data may be good till midnight of that day, when new schedule data is loaded into the database. For certain census data, the data in the cache may be good for years. This really depends on the nature of the data in the data source, and metadata specifications are used to describe the caching policy for the data source. 

1. A method to integrate data available from diverse sources such as static and dynamic web pages, files and documents in different formats, databases, and APIs (Application Program Interfaces) comprising the step of using a logical query language to integrate data access across the diverse sources.
 2. The method of claim 1, comprising the further steps of viewing a data source as a plurality of relations; extracting tuples of the relation from the data source using an automated process; the relation name, number of relation columns, column names, the location of the data source, and the automated process together defined as the metadata specification for each relation.
 3. The method of claim 1, comprising the further step of accessing data available from incomplete tables stored for an indeterminate period of time.
 4. The method of claim 3, wherein the only data access possible from a data source is output specifically related to provided inputs.
 5. The method of claim 1, wherein the data is under transactional control of some process other than the process accessing the data.
 6. The methods of claim 1, wherein the location of a metadata specification is different from the location of the data source, and no cooperation from either the data source provider and owner is required to apply the methods of claim
 1. 7. The method of claim 1, comprising the further step of describing a single data source by a plurality of different metadata specifications.
 8. The method of claim 7, wherein the plurality of metadata specifications can be supplied by a plurality of different authors operating independently, and stored at different locations on the Internet.
 9. The method of claim 2, further comprising the step of storing metadata on the Internet in a manner that such metadata specifications can be indexed by search engines and discovered by users when they supply relevant keywords.
 11. The method of claim 1, further comprising the step of storing search queries on the Internet in a manner that such queries can be indexed by search engines, discovered by search engine users, and then reused by those users. The method of claim 10 further comprising the step of creating a reusable search query that can be simplified for end users by restricting the presentation of the search query to just the user inputs required to execute the query.
 12. The method of claim 10 further comprising the step of customizing any discovered search query by the user who retrieved the search query.
 13. The method of claim 2 further comprising the step of limiting the maximum time period for which the metadata specification for each relation from a data source remains valid.
 14. The methods of claim 1 further comprising the step of storing data from a data source in a local cache; and retrieving the data from the cache if the data has been in the cache for less time than the maximum time period specified in the metadata specification.
 15. The method of claim 2, wherein the location of a metadata specification is different from the location of the data source, and no cooperation from either the data source provider and owner is required to apply the methods of claim
 2. 16. The method of claim 13 further comprising the step of storing data from a data source in a local cache; and retrieving the data from the cache if the data has been in the cache for less time than the maximum time period specified in the metadata specification. 